import pandas as pd import numpy as npfrom lets_plot import*from types import GeneratorTypeimport pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.naive_bayes import GaussianNBfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn import metricsLetsPlot.setup_html(isolated_frame=True)
Show the code
# import your data here using pandas and the URLdwellings = pd.read_csv('https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_denver/dwellings_denver.csv')dwellings_ml = pd.read_csv('https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_ml/dwellings_ml.csv')# code form the last HW x = dwellings_ml.filter(["quality", "condition", "livearea", "stories", "arcstyle", "basement", "condition_Fair", "nocars", "numbdrm", "netprice", "numbaths", "sprice", "qualified_Q", "deduct", "finbsmnt", "abstrprd",]) y = dwellings_ml['before1980'] x_train, x_test, y_train, y_test = train_test_split(x, y, test_size =.2, random_state =34)classifier_DT = GradientBoostingClassifier(max_depth =10)classifier_DT.fit(x_train, y_train)y_predicted_DT = classifier_DT.predict(x_test)print("Accuracy:", metrics.accuracy_score(y_test, y_predicted_DT))
Accuracy: 0.9159938904647611
QUESTION 1
Create 2-3 charts that evaluate the relationships between each of the top 2 or 3 most important variables (as found in Unit 4 Task 2) and the year the home was built. Describe what you learn from the charts about how that variable is related to year built.
type your write-up and analysis here
Show the code
# print("Stories, and livearea and abstrprd are the top 3 ")plot1 = ggplot(dwellings_ml, aes(x='yrbuilt', y='livearea')) +\ geom_point(alpha=0.3) + ggtitle('Live Area vs. Year Built')plot1.show()plot2 = ggplot(dwellings_ml, aes(x='yrbuilt', y='abstrprd')) +\ geom_boxplot() + ggtitle('Abstract Present vs. Year Built')# plot1.show()plot2.show()print("based on these two graphs it is obvious that live area had a much greater effect on the data. In the second graph is shows that after the 1980's live area became a lot more important and got above 5,000. So that means that before 1980 it was ushualy bellow 1980/ The other graph about thee abstrprd shows that there was a few outliers, but it stayed consistent throughput the years.")
based on these two graphs it is obvious that live area had a much greater effect on the data. In the second graph is shows that after the 1980's live area became a lot more important and got above 5,000. So that means that before 1980 it was ushualy bellow 1980/ The other graph about thee abstrprd shows that there was a few outliers, but it stayed consistent throughput the years.
QUESTION 2
Create at least one other chart to examine a variable(s) you thought might be important but apparently was not. The chart should show its relationship to the year built. Describe what you learn from the chart about how that variable is related to year built. Explain why you think it was not (very) important in the model.
type your write-up and analysis here
Show the code
# Include and execute your code hereplot3 = ggplot(dwellings_ml, aes(x='yrbuilt', y='numbdrm')) +\ geom_point(alpha=0.3) + ggtitle('Number of Bedrooms vs. Year Built')plot3.show()print("I thought that the number of bedrooms was going to be really important in finding out if a house was built before 1980. However, after graphing it, it is clear that there is no relashonship between year built and the number of bedrooms like I thought. All this graph shows is that 2-4 bedrooms is the most popular to build over the years.")
I thought that the number of bedrooms was going to be really important in finding out if a house was built before 1980. However, after graphing it, it is clear that there is no relashonship between year built and the number of bedrooms like I thought. All this graph shows is that 2-4 bedrooms is the most popular to build over the years.